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Abstract 


The recent explosion in the amount and dimensionality of data has exacerbated the 
need of trading off computational and statistical efficiency carefully, so that inference 
is both tractable and meaningful. We propose a framework that provides an explicit 
opportunity for practitioners to specify how much statistical risk they are willing to 
accept for a given computational cost, and leads to a theoretical risk-computation 
frontier for any given inference problem. We illustrate the tradeoff between risk and 
computation and illustrate the frontier in three distinct settings. First, we derive an¬ 
alytic forms for the risk of estimating parameters in the classical setting of estimating 
the mean and variance for normally distributed data and for the more general setting 
of parameters of an exponential family. The second example concentrates on compu¬ 
tationally constrained Hodges-Lehmann estimators. We conclude with an evaluation 
of risk associated with early termination of iterative matrix inversion algorithms in the 
context of linear regression. 
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1 Introduction 


The advent of massive datasets in applied fields has been heralded as a new age for statistics 
but these datasets are both a blessing and a curse. They offer the opportunity to im¬ 
prove the precision and efficiency of statistical methods but frequently these improvements 
come at high computational costs. Up until recently, the computational aspects of statistics 
had been largely ignored by the statistics literature with those concerns relegated to other 
helds. Statisticians have begun to explore computationally more efficient techniques for dis¬ 
parate problems using ideas like variational methods (Wainwright and Jordan, 2008), parallel 
MCMC (Scott et al, 2013), stochastic gradient descent (Langford et al, 2009; Bottou, 2012; 
Agarwal et al, 2014; Toulis and Airoldi, 2014), and convex relaxations (Chandrasekaran and 
Jordan, 2013). 

Recent efforts have begun to evaluate the computational cost of many statistical proce¬ 
dures and conversely the statistical risk of computationally efficient procedures while intro¬ 
ducing new procedures that balance these ideas. Kleiner et al (2014) describes a procedure 
known as the “bag of little bootstraps,” a data splitting technique that allows for computa¬ 
tionally tractable implementation of the bootstrap for massive datasets. Wang et al (2014) 
study the classical problem of covariance estimation under computational constraints. Yang 
et al (2015) find conditions that ensure a Markov chain Monte Carlo sampler for Bayesian 
linear regression in high dimensions will be both consistent and have fast mixing time. 
Chandrasekaran and Jordan (2013) consider “algorithm weakening” to describe an ordering 
of algorithms in terms of compute time and statistical properties. Horev et al (2015) study 
the tradeoff between statistical detection and computation in an edget detection framework. 

A common approach is to describe compute cost in terms of algorithmic complexity 
(Berthet and Rigollet, 2013; Shender and Lafferty, 2013; Bresler et al, 2014; Montanari, 
2014). Within this framework, two algorithms are equivalent if they have the same worst 
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case complexity, even if one of them is likely to never perform in the worst case regime. While 
computational complexity is a key aspect of understanding and choosing between algorithms, 
when a hner analysis is possible, it is preferred. Chandrasekaran and Jordan (2013) provides 
an example of this hner analyis via the “time-data tradeoff” and our approach is in a similar 
spirit. Specihcally, we consider a practical framework for addressing the tradeoff between 
computational cost and statistical risk. This is in contrast to other works such as Montanari 
(2014), who shows that some statistical procedures can be modihed to provide fast algorithms 
with well understood computational gains but without assessing the degradation in statistical 
risk. 

Our focus for this paper will be on classical statistcal problems including estimating the 
mean and variance for a normal popualation, exponential families, robustness, and regression. 
In Section 2, of this paper we outline the basic framework associated with the statistical risk 
and computational cost tradeoff frontier. Section 3 illustrates this framework in the normal 
population setting and in Section 4 we extend these ideas to general exponential families. We 
briehy consider robust estimates such as the Hodges-Lehmann example in Section 5. Finally, 
we consider extending these ideas to iterative methods for matrix inversion in Section 6. 


2 Analyzing statistical and computational tradeoffs 

Consider the problem of estimating 6 given an i.i.d. sample from the distribution fo. We 
suppose that the parameter is vector valued with 0 G 0 C the model is {fe : 0 G 0} and 
we denote our random sample as Xi,, Xn fe where Xi is X valued for df C M'^. The 
estimate is denoted is 6* = 9{Xi ,..., Xn) and the loss associated with the estimate is i{9, 9). 
In classical statistical estimation theory, one seeks to in some way minimize the risk, the 
expected loss R{9, 9) = E0[f'(0, 9)], be it in terms of minimax optimality, Bayesian optimality, 
or other principles such as unbiasedness or equivariance (Lehmann and Casella, 1998). 
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We will maintain the goal of achieving a small risk but we will add the goal of computing 
the estimate quickly. Formally, one need not have an algorithm to compute the estimator 
6 : df" I— >■ 0 for all values in df” in order to analyze the risk and statistical properties of the 
estimate. However, in our setting we will assume that each estimator comes equipped with 
an algorithm to compute the function of the data. Hence, an estimate, together with its 
algorithm, will have a compute time C{6) G M"*" that denotes the runtime of that algorithm 
on the data Xi,... ,X„. Altogether we now have two quantities, the risk R{6,6) and the 
expected compute time E 0 [C( 0 )]. 

Remark. To keep things more straightforward, the first few examples considered in this 
manuscript will have the property that C{6) does not depend on the data Xi,... ,X„ so that 
the expected compute time does not depend on the parameter. Clearly many algorithms do 
not fit this mold and much of the ideas we discuss apply outside the fixed computational cost 
setting. 

Now, consider a practitioner confronted with a collection of estimators, each with an 
associated algorithm. We denote this collection by {0s}sg5 where S is some index set. 
The collection of estimates may be determined by questions such as: How much storage is 
available? Can all the data be kept in memory or only a subset? How much processing 
power is available? Are there parallel or distributed systems that can be exploited? 

Each of these questions will put different constraints on the collection of estimators that 
delineates what is possible and ready for use. Among the feasible estimators, the practitioner 
must chose an estimator 9^ to use on the data at hand. If the practitioner knows R{6s,9) 
and C{6s) for each estimate and parameter value than they can use this to make an informed 
decision about which estimator to choose that balances risk and computational cost. For 
a given parameter value 6 we can plot the risk and computation time associated with each 
estimator as was done in Figure 1. This hgure illustrates the idea of a computational- 
statistical trade-off with some estimates achieving very low risk, others being very fast. 
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Figure 1: An illustration the risk and computation time trade-off associated with a collection 
of 6 estimators. Estimator A is the fastest estimators but has much higher risk than the 
other estimators while estimator F has the lowest risk but suffers from being quite slow. 
Estimators B or D might be a good choice for a practitioner seeking to achieve a balance 
between time and accuracy. Note that estimator C can be disregarded since both B and 
D are strictly better than C in terms of both risk and computation time. The dashed line 
depicts the theoretical risk-computation frontier. 


and some providing a balance between the two. In the next section we will examine this 
framework in the setting estimating the mean and variance for a sample from a normal 
distribution before investigating exponential families in general in Section 4. 


3 Normal example 

To concretely illustrate the computation-statistical tradeoffs we will consider the simple 
example of estimating the population mean and variance from a sample of independent and 
identically distributed normal random variables. We consider two computational constraints 
that dehne the collections of algorithms that are available for estimation. The hrst setting 
explores a singly indexed set of estimators for a near zero resource streaming setting. The 
second setting generalizes the hrst by allowing various ways to divide the data between 
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different aspects of the estimation. 


3.1 Standard inference 

Suppose that we observe Xi,..., ~ cr^) and we want to estimate 6 = {fi, cr^) with 

our loss being square error loss, i{9,9) = \\9 — 6*||| = (/i — /i)^ + — cr^)^. Our analysis 

allows for loss functions that are other linear combinations of the risks for /i and however 
for ease of illustration we focus on this loss function. 

Before delving into various computationally constrained estimates, consider the standard 
maximum likelihood estimates (MLE) of the mean and variance for a normal. 

/iMLE = X = -Vx, and = = -y2x^-X\ (1) 

i=l i=l 

To compute this estimate we require two operations or looks at each data point to compute 
the sufficient statistics X and X'^ —that is we need to temporarily store each data point in 
order to perform a local operation before updating the sufficient statistics. This paradigm of 
“looking” at data point and performing an operation with them dehnes our computation cost 
in this problem. The total computational cost for the MLE is thus 2n} The risk associated 
with this estimate is 


E [{fiMLE - + E 


2cr^ 
n n 


3.2 Streaming setting 


We now consider what we call the streaming setting, a near zero-resource setting where local 

storage is extremely limited, allowing us access to each data point exactly once. This means 

^One might claim the cost is 2n + C where the C represents the additional computation needed to 
complete the computations in Eq. (1). We omit this since this additional time is unchanged for all estimates 
and algorithms in this section. 
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that we can’t compute the full MLE which requires two looks at each point to compute the 
hrst and second moments. We consider estimators with index set = — 1} where 

for each s G S' the estimate Og is computed by hrst computing the moment sums 

s n 

i=l 2=S+1 

where n is the total sample size. We could have allowed the sums to be non-sequential but 
provided s samples are used for Mi^g and n — s for M 2 ,s the estimates are the same in risk 
and computation time due to the fact that the samples are exchangeable. We assume that 
updating either Mi^g or M 2 ,s has the same cost so that the cost of computing these statistics 
is exactly n as each data point is accessed only once and again we can simply keep track of 
two sufficient statistics. 

Remark. We remark here that assigning a cost ofn to each of these estimators is, of course, 
an abstraction but a useful one nonetheless. In reality the time to compute an estimate will 
be some function of n, s, and will also depend on the implementation in high-level and low- 
level languages down to the structure of the hardware being used. However, for a large range 
of n and s we believe it is reasonable to assume that the cost to compute Mi^g and M 2 ^s will 
be linear in s and n — s respectively. Boiling this down to the assumption that the cost is n 
makes the following analysis guite clear but generalizing slightly does not change the overall 
flavor of the result as we discuss at the end of Section f. As we go forward, this paper we 
will make similar assumptions about the computational cost that allow for numerical analysis 
but do not impact the overall framework. 

From these statistics we can dehne the unbiased streaming estimates for the mean and 
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variance as as 


P'S 


(J, 


-Ml, 5 , and 
s 


s 

(s — l)(n — s) 



We note that nnlike in the example of the MLE where estimates of the mean and the variance 
are independent, for the streaming setting the estimates of the mean and the second non¬ 
central moment are independent while the estimates of the mean and the variance are not. 
This will be explored in greater detail in Section 4 when discussing estimates of natural 
versus mean value parameters for general exponential families. 

The risks of the mean and variance, both under quadratic loss, in the streaming setting 
are given by 


R-(/is, /i) - 


a 


^snjjpa'^ + 2 ((s — l)s -|- n)a^ 
(s — iy{n — s) 


Alternatively, we could consider the maximum likelihood estimate given the observed 
statistics Mi, 5 , M 2 ,^ and t: 

1 1 /I \^ 

pMLE,s = and s = - ^2,3 - 7 ^ 1 ,^ • (2) 

s ' n — s \t J 

As n gets larges and t/n tends to a constant p G (0,1), these estimates are essentially 
equivalent and will both have the same asymptotic risk. Indeed, asymptotically we have 
that the risk is approximately 

cr^ (4/i^ -I- 2pa‘^ — p -I- 1) 

?7,p(l — p) 
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and one can then verify that in order to minimize this asymptotic risk as a function of (/i, a) 
one should select p = . ^ — , Intuitively, this says that if the signal to noise ratio 

is very small, than p is close to zero with most effort put on computing the second moment 
M 2 ^s- On the other hand if and /r 3> 0 then p will be close 1/2. Finally, if 

and /i ~ 0 then p will be close to 0. 

3.3 Risk/computation frontier 

The streaming setting is a special case of a slightly more general collection of estimators. In 
particular, we allow up to two looks and operations for each point with some points possibly 
only used for the computation of one of the two statistic, others may be used for both, while 
still other samples may be ignored completely. The collection of estimates is indexed by 
pairs of sets (S'!, S 2 ) where each Si is a non-empty subset of {1 ,..., n} so the index set is 

^ = {(^ 1 , ^ 2 ) : C [n], ^2 C [n], |^i| > 0, |^ 2 | > 0} . (3) 

For s = {Si, S 2 ) G S' we hrst compute 

= M2,s = Y.XI (4) 

ieSi ie52 

and we let rii = [S'! \ S' 2 |, n 2 = IS '2 \ S'!] and ni 2 = [S'! fl S' 2 |. As in the previous setting, 
the particular sets Si and S 2 only impact the risk and computation time of the estimates in 
terms of rii, n 2 and ni 2 . The computational cost we assign to this procedure is 

C = TLi + n2 -\- 2 ^ 12 , 


since ni 2 indicates the number of samples used for both statistics while ni and n 2 indicate 
the number of samples used only for computing Mi^s and M 2 ^s, respectively. Our maximum- 



likelihood-like estimates are then 


1 


f^S 


rii + ni2 


Mi^s and = 


n2 + ni2 


M2,s — fl'l 


To get the risk of these estimators we provide a glimpse of the general Fisher information 
result from the following section. 


Proposition 1. Let Xi,...,Xn ~ normal(/i, cr^). For s = { 81 , 82 ) G 8 from Eq. (3), let 
Mi^s, M 2 ^s, ni, n 2 andni 2 he as in Eq. (4). Forn large, the covariance for estimates {fis,^^) 
of {fi, a) is approximately 


/ 

V 


ni+n\2 

2n2/^cr^ 

{ni+ni2){ni2+n2) 


2n2lJ,E 

(ni+ni2)(ni2+n2) 

2(ni+ni2)i7'^+4(ni+n2)M^cr^ 

{ni+ni2)(ni2+n2) 


(5) 


and so the overall risk of the estimates is approximately 


^2{ni+ni2)a^ + 4,{ni+n2)tx^a‘^ 
ni + ni2 (ni -h nia)(ni2 ^ 2 ) 


For a given jx and a we can compute ni, 77,2 and ni 2 that minimize the total risk for any 
given computational cost C satisfying 2 < C < 2n. Several scenario, in terms of the signal 


SNR 


ni/n 77 . 2/77 

7712/77 

= 0 

> 1/2 


2 Change points 

= 0 

= 1/2 

Non-unique Solution 

= 0 

< 1/2 


2 Change points 

> 1 

> 0 

0 

0 

C/2 

e (0, 1 ) 

> 1/2 

0 

ac 

{C-ac)/2 

€ (0,1) 

= 1/2 

0 

0 

C/2 

e (0,1) 

< 1/2 

be 

0 

{C - bc)/2 


Table 1: The split of the data points between the three possible usages (mean only, non 
central second moment only, or both) is provided in columns 3 through 5. Both ac and be 
approach to 0 as C* —)■ 2 n. See Figure 1 for a detailed look at particular parameter values in 
these different regimes. 
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(a) /i = 1/10 and a = 1/2 




1.0 1.5 

computation 



(c) /i = 1/2 and cr = 10 



(d) = 0 and <7 = 1 



Figure 2: In each row, the left panel illustrates the optimal proportions rii/n in red, n 2 /n 
in green and ni 2 /n in blue as a function of the computational constraints given along the 
horizontal axis. The computational constraint is given as a proportion of the number of 
samples n, so that on the horizontal axis 1 indicates a cost of n and 2 indicates a cost of 2n, 
where the standard estimate is always optimal. The right panel illustrates the risk using the 
given estimate with the given proportions. Each of the rows corresponds to a different row 
in Table 3.3. 

to noise ratio SNR = /r/cx and variance a are presented in Table 3.3. In particular, we note 
that if SNR = 1/a/2 we always choose to use less of the data and essentially construct an 
MLE. Otherwise, there are different optimal ni, n 2 and ni 2 based on whether SNR is zero 
or between 0 and 1 and on whether cx^ is greater than, equal to, or less than 1/2. To further 
illustrate the different regimes. Figure 2 shows the optimal portions of the samples used and 
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the associated risk as the computational constraint varies. In the next section we explore 
the tradeoff of risk and compute time for the more general setting of exponential families. 


4 Exponential family 

A p parameter exponential family is a family of distributions {fe}e£e oii a space X each 
with a density with respect to an appropriate carrying measure p which can be written in 
the form 


fe{x) = h(a;) exp{ 6 *^t(x) — tl'( 6 *)} 


where h is a function from X to M"*", 0 G 0 C is the natural parameter for the model, 
t = (ti,t 2 , ■ ■ ■ ,tpy is a real-valued p-dimensional sufficient statistic, ie. 4 : A i—)■ M, and 
\1'(6') = log h{x) exp{6'^T{x)}dx, (Bickel and Doksum, 1976). 

For a random sample Xi ,..., ~ fe, the statistic T = t{^i) ^ is sufficient and 


the maximum likelihood estimate 6 is found by solving for 6 in the equation T/n = E 0 [t(X)]. 


As in the normal example, for our purposes it is convenient to reparameterize the model 
using the mean value parameterization with parameters r = t{9) = E 0 [T(X)] G where 
the MLE for r is simply T/n. Note that the map t : 6 ^ t{6) is invertible so that r is a 
proper reparametrization of the model. 

Using the same ideas as in our normal example, we can construct alternative statistics by 
computing each component of the sufficient statistic on a subset of the full data set. Formally, 
our estimates will be indexed by subsets Si,..., Sp C. [n] so the index set is 5 = For 

S = (S'!,..., Sp) G S, we dehne the statistics 



^Again, the actual sets are not critical but only their cardinality and the cardinality of the pairwise 
interections. 
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and the estimate for Tk is fs,k = |>S'a:| ^Ts^k- The estimate for 9 is dehned analagously, 
Os = T~^{'Ts)- 

The covariance for fs can be written in terms of the covarariance for T{X) which is 
the inverse of the Fisher information matrix for r, and the cardinality of the sets Si,Sp 
and their pairwise intersections. Specihcally, the variance terms are Var(|S'fc|“^T 5 fc) = 
^ and the covariance terms are Cov(|S'A;|“^Ts'^fc, • 

Note that the analog to the streaming case for a p-parameter exponential family is where 
Sk C) Si = 0 for all /c ^ / G [p] and in this case the covariance matrix for fs will be diagonal 
as isevident by the fact that each statistic is computed on an independent sample. However 
the covariance for 9s will usually not be diagonal, as can be verihed in the case of the normal 
example. 

For a given cost level c and estimate 9 = 9 {t) = we can hnd the best subsets 

by solving an appropriate optimization problem. Frequently the compute times for each 
statistic will be different and so we can dehne the computational cost for each statistic in 
terms of the cost to compute tk{x) and the set sizes |S'fc| for /c G [p\. Specihcally, we denote by 
Cfc the runtime to perform the operation that computes tk{xi) and adds it to the partial sum. 
The risk for estimating 9 is R{9,9s) = tr ^0(r)Cov(f5)0(r)^j but if we want to estimate 
another parameter p = p{t) then the risk is given by tr(p(r)Cov(f 5 )? 7 (r)^) where ? 7 (r) is 
the gradient of p with respect to r. Finally, some parameters may be more important to 
estimate than others and so we allow for the scaling of the covariance by a non-negative 
diagonal matrix Q which indicates the relative importance of the different components of 
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the parameter. Together this yields the optimization problem 


mm 

se-s 




(6) 


P 


such that 



if A; = / 


(7) 


where Ti^i 


\SknSi\r\T)ki 


( 8 ) 


otherwise. 


In Section 3 we were able to solve this problem in the case of the normal distribution 
and estimation of the mean and variance parameters. Solving this problem for certain other 
distributions such as the multivariate normal is also relatively straightforward. In general, 
computing the Fisher information matrix and its inverse for the mean value parameterization 
of an exponential family is a nontrivial task. Additionally, the functions t]{t) are generally 
difficult to compute especially in high dimensions. For example, Montanari (2014) shows 
that for certain classes of graphical models hnding computing the map from the mean-value 
to the natural parameter space is an NP-hard problem in the dimension of the parameter 
space. Nonetheless, the formulation of this optimization problem offers another step towards 
an understading the frontier of the risk-runtime tradeoff. In the next two sections we will 
deviate slightly and consider estimation procedures with slightly less well understood 


5 Hodges-Lehmann estimator 


As another investigation into possible tradeoffs between computation time and statistical 


risk we consider the Hodges-Lehmann (HL) estimate for the mean of a distribution. For a 
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sample of size n, this estimates is defined as 


9hl = median 


Xi + X,- 


: e [n\ 


(9) 


This estimate is known to be very robust and often outperforms both the mean and the 
median in terms of statistical risk for data arising from distributions with contamination. In 
this example we consider the contaminated distribution 


(1 - aMO, 1 ) + 0X3(4) 


where 73(4) is a central t distribution with three degrees of freedom and then scaled by a 
factor of four. Each mixture component has mean zero however approximately ten percent 
of any sample will be from a contaminated distribution with much heavier tails and higher 
variance. The proportion of data arising from the t distribution is the contamination level 

a. 

The computation cost of HL estimator can be decomposed into two parts, the time to 
compute the pairwise sums, which requires n{n + 1) looks at the data, and the time to 
compute the median. The time to compute the median is dominated by the number of 
comparisons needs and will in practice depend on the data. For the simulations below we add 
the the expected number of comparisons, as determined in Knuth (1972) for the QuickSelect 
algorithm, to the computation cost. Asymptotically, the expected time to compute the 
median of n samples is approximately 3.38n, while in comparison, the mean requires only n 
operations as described above. We considered a variety of estimates in order to reduce the 
computation time of the HL estimate and compare them in Figure 3: 

subset We hrst select a subset of the data of m and then sample without replacement c/2 
from all pairs in this subset. 

sample We sample with replacement c/2 pairs from all (”) possible pairs from the entire 
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data set. 


sequential At cost c we use the c/2 pairs (Xi, X2), (X3, X4),..., (Xc_i, Xc) 

For each of these estimates we compute the mean for each of the selected pairs and then 
compute the median for that set. 

We used a sample size of n = 2000 and for the subset HL estimate we used m = [a/ 2000J . 
For the mean we simply considered the sample mean of the hrst c points for c G [n] and for 
the three HL estimates we used even costs from 2 to n. We simulated 5 x 10"^ replicates for 
each contamination level to estimate the risk associated with each estimator at each cost. 
Overall, the best estimates were either the sample mean or the sequential HL, at least up 
to the feasible costs for those methods. Choosing between the sample mean and the HL 
sequential depends on the cost restraints as well as the contamination level as shown in 
Figure 3. Overall, this example illustrates the intricacies of the computational-statistical 
trade-off frontier for even relatively straightforward settings. 


6 Matrix inversion 

One of the most important linear algebra operations for statistical analysis is the matrix 
inverse. It is also frequently the bottleneck of statistical procedures as the operation is 
naively of complexity order 0{'n?) (Gauss-Jordan elimination) and optimally, if impractically, 
of order (Williams, 2012). All practical algorithms for matrix inversion are based 

on iterative approaches. Each iteration can naturally dehne a computational cost metric 
that allows us to evaluate the statistical risk versus computational cost tradeoff within an 
algorithm. Since all iterative methods are meant to converge to the same numerical value 
this also suggests a method for comparing across algorithms when explicit costs cannot be 
dehned. 
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5% 10% 20% 



estimator — mean - - HL subset ■ ■ ■ HL sample HL sequential 


Figure 3: Computation and risk on the log-log scale for the sample mean and variations on the 
Hodges-Lehmann estimators. The computation is determined by the number of operations 
to compute the pairwise means and the number of comparisons to compute the median. The 
three panels correspond to different levels of contamination of 5%, 10% and 20%. At very 
low costs and very high costs the HL sequential estimator is best but at moder costs, between 
100 and 1000, the level of contamination can impact whether the HL sequential estimator 
or the mean estimator is to be preferred. 

In this section we consider the problem of hnding the least squares estimator in a standard 
linear regression 

Y = Xp + e 

where the columns of X are correlated. Each of the p columns of X represents an attribute 
of an individual. The solution /3 = {X^X)~^X^Y is well known and requires the inversion 
of the Gram matrix S = X^X. We explore two iterative methods for matrix inversion. The 
hrst is a naive Newton-Raphson (NR) algorithm that inverts the matrix A via the iterative 
procedure = 2A'^\ — A'^\AA'^\. It is clear that by letting /c —)■ cx) we get A'^^ —)■ A~^ 
such that A~^A = I for I the identity matrix. 

The second method builds on the power method for eigenvector and eigenvalue approx¬ 
imation. It is well known that the eigenvector associated with the largest eigenvalue of a 
symmetric matrix A can be computed via the iteration where || ■ II 2 is 

the squared L 2 norm. To compute the eigenvector associated with the second eigenvalue one 
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first computes and then performs the above iteration replacing A with A — A 

similar expression is available for smaller eigenvalnes. We consider several stopping criteria 
for for this approach. First we consider stopping the compntation of the first eigenvector 
after k steps, then compnte the second eigenvector based on A — also stopping after 

k, and so on. A second approach considers stopping the first iteration after k steps, the 
second after k — 1 steps, until the pth after k — p steps. 

For the purposes of exposition we consider a matrix X = where the entries 

Zij ~ A/'(0,1), C = (1 — p)/ + llV) the compound symmetry correlation matrix, and 
D is a diagonal matrix with entries decreasing uniformly from 4 to 2. As p ^ 1 the 
matrix approaches rank deficiency which suggests that for larger values of p algorithms that 
approximate the inverse via lower rank matrices are likely to perform as well as full rank 
inversions. In this simulation we consider p = 10. The inversion via Newton-Raphson has on 
the order of 10^ steps, but in practice no more than 20 steps are needed for numeric conversion 
of the algorithm. The power method approaches have the same algorithmic complexity but 
converge even faster in practice. Throughout, the true value of /3 is a uniformly separated 
sequence from —1 to 1 and we consider the risk of estimating /3 under quadratic loss. For 
independent and identically distributed noise e ~ A/'(0,1) we know that the risk of estimating 
(3 is given by 'Yhi diag and so we use this exact value to conhrm that an inversion 

method has converged. 

We simulate 10,000 datasets in order to estimate the risk for (3 across three values of p 
between 0.01 and 0.88. In Figure 4 we see the outcomes of the experiment for both types of 
inversion procedures. For the power method algorithm the computation cost is determined by 
the total number of iterations of a single run. For example, the linear method costs — ( 2 ) 
iterations, while the approach that stops every iteration after k steps costs kp. Each iteration 
of NR is assigned a cost of 20 since each one requires two matrix multiplications which involve 
20 vector multiplications and each iteration of the power method is a vector multiplication. 
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p^O.01 


p = 0.38 





computation 


Figure 4: Three plots of simulation results comparing computational cost (as measured by 
the number of iterations of an algorithm) versus the risk under quadratic loss of estimating 
the coefficients in a linear regression. Each plot represents a different level of dependence 
among the columns of the design matrix X. 


First it is evident that the risk is reduced non-linearly with increases in computation costs 
for both methods. The NR algorithm converges to the risk of the fully inverted matrix 
faster than the two power-method approaches. However it does so at the expense of poor 
performance before full convergence. In particular, both NR and power-method approaches 
are initialized naively. While NR has an initial risk (not shown on the plots due to scale) 
hundreds of times greater than the lowest risk, both power methods are within 30% of the 
lowest risk after only a few iterations. Throughout the plots we can see that NR is very 
dependent on the value of p as that determines how well S is approximated by a low rank 
matrix. The two power-methods appear agnostic to the value of p with the exception of 
how smoothly they approach the lowest risk. This can be explained by the interdependence 
between iterations of the power method - that is, a slight improvement in the estimation 
of the hrst eigenvector (prior to the convergence of the power method iterations) does not 
guarantee an improvement in the estimation of the second eigenvector. 
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7 Conclusions 


This article proposed an interpretable framework for the tradeoff between compntational 
cost and statistical risk. Onr approach introdnced exact compntational cost into the analysis 
of statistical methods. This is hrst illnstrated via the classical example of estimating the 
mean and variance of a sample of normal random variables. In this setting, we snggest 
that the nse of a single data point for the npdate of a snfficient statistic shonld incur a 
cost of one. This allows us to compute exact and asymptotic risks associated with mean 
and variance estimation under a computational constraint. We extended this framework 
to general exponential families in Section 4. We further illustrated our framework in the 
context of robust estimators (Section 5) and iterative procedures (Section 6). 

We note, as we did in Remark 3.2, that we have made simplifying assumptions about 
the computational costs and runtimes of various procedures. These assumptions allow for 
the subsequent analysis and we believe are still helpful in guiding the choice of estimators. 
Sometimes a more detailed and hne-grainded analysis may be desired that does not employ 
these abstractions. In this case we believe that the practitioner could use benchmarking tools 
to precisely measure the runtimes of various aspects of their procedures which, together with 
algorithmic analysis, can be used to employ our framework in choosing the best procedure 
for the problem at hand. 

Beyond the applications presented in this article our approach can be employed whenever 
computational constraints are present. In the context of experimental design, this framework 
can inform the number of observations or the number of subjects needed for a study. For 
high throughput data it can assist in deciding on a sampling mechanism when all data 
cannot be read into memory. The iterative procedures section suggests the development of 
analogues to standard methodology (such as linear regression and spectral clustering) that do 
not necessitate numerical convergence of intermediary steps but that still preserve desirable 
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statistical properties. 


References 

Agarwal A, Chapelle O, Dudik M, Langford J (2014) A reliable effective terascale linear 
learning system. J Mach Learn Res 15:1111-1133 

Berthet Q, Rigollet P (2013) Computational lower bounds for sparse pea. arXiv preprint 
arXiv: 13040828 

Bickel PJ, Doksum KA (1976) Mathematical statistics. Holden-Day, Inc., San Francisco, 
Calif.-Diisseldorf-Johannesburg, basic ideas and selected topics, Holden-Day Series in 
Probability and Statistics 

Bottou L (2012) Large-scale machine learning with stochastic gradient descent. In: Statistical 
learning and data science, Comput. Sci. Data Anal. Ser., CRC Press, Boca Raton, FL, pp 
17-25 

Bresler G, Gamarnik D, Shah D (2014) Hardness of parameter estimation in graphical mod¬ 
els. arXiv preprint arXiv: 14093836 

Ghandrasekaran V, Jordan MI (2013) Gomputational and statistical tradeoffs via convex 
relaxation. Proc Natl Acad Sci USA 110(13):E1181-E1190, DOI10.1073/pnas.1302293110, 
URL http: //dx. doi . org/10.1073/pnas. 1302293110 

Horev I, Nadler B, Arias-Gastro E, Galun M, Basri R (2015) Detection of long edges on a 
computational budget: a sublinear approach. SIAM J Imaging Sci 8(l):458-483 

Kleiner A, Talwalkar A, Sarkar P, Jordan MI (2014) A scalable bootstrap for massive data. 
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 


20 


Knuth DE (1972) Mathematical analysis of algorithms. In; Information processing 71 (Proc. 
IFIP Congress, Ljubljana, 1971), Vol. 1: Foundations and systems, North-Holland, Ams¬ 
terdam, pp 19-27 

Langford J, Li L, Zhang T (2009) Sparse online learning via truncated gradient. J Mach 
Learn Res 10:777-801 

Lehmann EL, Casella G (1998) Theory of point estimation, 2nd edn. Springer Texts in 
Statistics, Springer-Verlag, New York 

Montanari A (2014) Computational implications of reducing data to sufficient statistics. 
arXiv preprint arXiv: 14093821 

Scott SL, Blocker AW, Bonassi FV, Chipman HA, George El, McCulloch RE (2013) Bayes 
and big data: The consensus monte carlo algorithm. In; EFaBBayes 250 conference, vol 16 

Shender D, Lafferty J (2013) Computation-risk tradeoffs for covariance-thresholded regres¬ 
sion. In; Proceedings of The 30th International Conference on Machine Learning, pp 
756-764 

Toulis P, Airoldi EM (2014) Implicit stochastic gradient descent for principled estimation 
with large datasets. arXiv preprint arXiv: 14082923 

Wainwright MJ, Jordan Ml (2008) Graphical models, exponential families, and variational 
inference. Foundations and Trends© in Machine Learning l(l-2):l-305 

Wang T, Berthet Q, Samworth RJ (2014) Statistical and computational trade-offs in esti¬ 
mation of sparse principal components. arXiv preprint arXiv: 14085369 

Williams VV (2012) Multiplying matrices faster than coppersmith-winograd. In: In Proc. 
44th ACM Symposium on Theory of Computation, Citeseer 

Yang Y, Wainwright MJ, Jordan Ml (2015) On the computational complexity of high¬ 
dimensional bayesian variable selection. arXiv preprint arXiv: 150507925 


21 



